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Suppose that we observe entries or, more generally, linear combi- 
nations of entries of an unknown m x T-matrix A corrupted by noise. 
We are particularly interested in the high-dimensional setting where 
the number mT of unknown entries can be much larger than the sam- 
ple size N. Motivated by several applications, we consider estimation 
of matrix A under the assumption that it has small rank. This can 
be viewed as dimension reduction or sparsity assumption. In order 
to shrink toward a low-rank representation, we investigate penalized 
least squares estimators with a Schatten-p quasi-norm penalty term, 
p < 1. We study these estimators under two possible assumptions — 
a modified version of the restricted isometry condition and a uniform 
bound on the ratio "empirical norm induced by the sampling opera- 
tor/Frobenius norm." The main results are stated as nonasymptotic 
upper bounds on the prediction risk and on the Schatten-g risk of 
the estimators, where q £ [p, 2]. The rates that we obtain for the pre- 
diction risk are of the form rm/N (for ra = T), up to logarithmic 
factors, where r is the rank of A. The particular examples of multi- 
task learning and matrix completion are worked out in detail. The 
proofs are based on tools from the theory of empirical processes. As 
a by-product, we derive bounds for the fcth entropy numbers of the 
quasi-convex Schatten class embeddings Sp "— > 5 , | / , p < 1, which are 
of independent interest. 

1. Introduction. Consider the observations (Xi,Yi) satisfying the model 

(1.1) Y i = tr(X' i A*)+Z i , i = l,...,N, 

where Xj S M. mxT are given matrices (m rows, T columns), A* E M. mxT is 
an unknown matrix, £j are i.i.d. random errors, tr(B) denotes the trace of 
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square matrix B and X' stands for the transposed of X. Our aim is to 
estimate the matrix A* and to predict the future Y-values based on the 
sample (Xi,Yi), i = 1, . . . , N. 

We will call model (1.1) the trace regression model. Clearly, for T = 1 it 
reduces to the standard regression model. The "design" matrices Xi will 
be called masks. This name is motivated by the fact that we focus on the 
applications of trace regression where Xi are very sparse, that is, contain 
only a small percentage of nonzero entries. Therefore, multiplication of A* 
by Xi masks most of the entries of A*. The following two examples are of 
particular interest. 

(i) Point masks. For some, typically small, integer d the point masks Xi 
are defined as elements of the set 



where efc(m) are the canonical basis vectors of W n . In particular, for d = 1 the 
point masks Xi are matrices that have only one nonzero entry, which equals 
to 1. The problem of estimation of A* in this case becomes the problem 
of matrix completion; the observations Yi are just some selected entries of 
A* corrupted by noise, and the aim is to reconstruct all the entries of A. 
The problem of matrix completion dates back at least to Srebro, Rennie and 
Jaakkola (2005), Srebro and Shraibman (2005) and is mainly motivated by 
applications in recommendation systems. We will analyze the following two 
special cases of matrix completion: 

- USR (Uniform Sampling at Random) matrix completion. The masks Xi 
are independent, uniformly distributed on 



and independent from £i, . . . , £,n- 
- Collaborative sampling ( CS) matrix completion. The masks Xi (random 
or deterministic) belong to X\, are all distinct and independent from 

The CS matrix completion model is natural to describe recommendation 
systems where every user rates every product only once. The USR matrix 
completion can be used for transmission of a large-dimensional matrix trough 
a noisy communication channel; only a chosen small number of entries is 
transmitted, and nevertheless the original matrix A* can be reconstructed 
by the receiver. An important feature of the real-world matrix completion 





X 1 = {efc(m)e{(T) : 1 < k < m, 1 < I < T} 
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problems is that the number of observed entries is much smaller than the 
size of the matrix: N <C mT, whereas mT can be very large. For example, 
mT is of the order of hundreds of millions for the Netflix problem. 

(ii) Column or row masks. If Xi has only a small number d of nonzero 
columns or rows, it is called column or row mask, respectively. We suppose 
here that d is much smaller than m and T. A remarkable case d = 1 is 
covering the problem known in Statistics and Econometrics as longitudinal 
(or panel, or cross-section) data analysis and in Machine Learning as multi- 
task learning. In what follows, we will designate this problem as multi-task 
learning, to avoid ambiguity. In the simplest version of multi-task learning, 
we have N = nT where T is the number of tasks (for instance, in image 
detection each task t is associated with a particular type of visual object, 
e.g., face, car, chair, etc.), and n is the number of observations per task. 
The tasks are characterized by vectors of parameters a£ G M m , t = 1, . . . , T, 
which constitute the columns of matrix A*: 

A* = {a\---a* T ). 

The Xi are column masks, each containing only one nonzero column x^ t,s ^ G 
M m (with the convention that x(* ,s ) is the tth column): 

Xi G {(0- • -0x (t ' s) 0- • -0), t = l,...,T,s = l,...,n}. 

t 

The column x^'^ is interpreted as the vector of predictor variables corre- 
sponding to sth observation for the tth task. Thus, for each i = 1, ...,N 
there exists a pair (t, s) with t = 1, . . . , T, s = 1, . . . , n, such that 

(1.2) tr(X>A*) = (a* t yx^\ 

If we denote by and the corresponding values Yi and £j, then the 

trace regression model (1.1) can be written as a collection of T standard 
regression models: 

y(*' s ) = (a*)'x(*' s ) + , t = 1, . . . , T, s = 1, . . . , n. 

This is the usual formulation of the multi-task learning model in the litera- 
ture. 

For both examples given above, the matrices Xi are sparse in the sense 
that they have only a small portion of nonzero entries. On the other hand, 
such a sparsity property is not necessarily granted for the target matrix 
A*. Nevertheless, we can always characterize A* by its rank r = rank(A*), 
and say that a matrix is sparse if it has small rank; cf. Recht, Fazel and 
Parrilo (2010). For example, the problem of estimation of a square matrix 
A* G R mxm is a parametric problem which is formally of dimension m? but 
it has only (2m — r)r free parameters. If r is small as compared to m, then 
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the intrinsic dimension of the problem is of the order rm. In other words, the 
rank sparsity assumption r<misa dimension reduction assumption. This 
assumption will be crucial for the interpretation of our results. Another 
sparsity assumption that we will consider is that Schatten-p norm of A* 
(see the definition in Section 2 below) is small for some < p < 1 . This is 
an analog of sparsity expressed in terms of the £ p norm, < p < 1, in vector 
estimation problems. 

Estimation of high-dimensional matrices has been recently studied by 
several authors in settings different from the ours [cf., e.g., Meinshausen and 
Buhlmann (2006), Bickel and Levina (2008), Ravikumar et al. (2008), Amini 
and Wainwright (2009), Cai, Zhang and Zhou (2010) and the references cited 
therein]. Most of attention was devoted to estimation of a large covariance 
matrix or its inverse. In these papers, sparsity is characterized by the number 
of nonzero entries of a matrix. 

Candes and Recht (2009), Candes and Tao (2009), Gross (2009), Recht 
(2009) considered the nonnoisy setting (£j = 0) of the matrix completion 
problem under conditions that the singular vectors of A* are sufficiently 
spread out on the unit sphere or "incoherent." They focused on exact re- 
covery of A*. Until now, the sharpest results are those of Gross (2009) and 
Recht (2009) who showed that under "incoherence condition" the exact re- 
covery is possible with high probability if N > Cr(m + T) log 2 m with some 
constant C > when we observe N entries of a matrix A* £ R mx - r with 
locations uniformly sampled at random. Candes and Plan (2010a), Kesha- 
van, Montanari and Oh (2009) explored the same setting in the presence 
of noise, proposed estimators A of A* and evaluated their Frobenius norm 
\\A — A*\\f- The better bounds are in Keshavan, Montanari and Oh (2009) 
who suggest A such that for A* G M mxT and T = am with a > 1 the squared 
error \\A — -A*||^ is of the order a 5 / 2 rm? (log N) /N with probability close to 
1 when the noise is i.i.d. Gaussian. 

In this paper, we consider the general noisy setting of the trace regression 
problem. We study a class of Schatten-p estimators A, that is, the penalized 
least squares estimators with a penalty proportional to Schatten-p norm; cf. 
(2.5). The special case p = 1 corresponds to the "matrix Lasso." We study 
the convergence properties of their prediction error 

N 

d 2 , N (A,A*) 2 = N- 1 ^tr 2 (X 4 '(i - A*)) 

i=l 

and of their Schatten-g error. The main contributions of this paper are the 
following. 

(a) For all <p < 1, under various assumptions on the masks Xi (no as- 
sumption, USR matrix completion, CS matrix completionmatrix com- 
pletionmatrix compl) we obtain different bounds on the prediction error 
of Schatten-p estimators involving the Schatten-p norm of A* . 
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(b) For p sufficiently close to 0, under a mild assumption on Xj, we show that 
Schatten-p estimators achieve the prediction error rate of convergence 
— max (™> T ') ; U p to a logarithmic factor. This result is valid for matrices A* 
whose eigenvalues are not exponentially large in N. It covers the matrix 
completion and high-dimensional multi-task learning problems. 

(c) For all < p < 1, we obtain upper bounds for the prediction error under 
the matrix Restricted Isometry (RI) condition on the masks Xi , which is 
a rather strong condition, and under the assumption that rank(^4*) < r. 
We also derive the bounds for the Schatten-g error of A. The rate in the 
bounds for the prediction error is rmax(m,T)/N when the RI condition 
is satisfied with scaling factor 1 (i.e., for the case not related to matrix 
completion and high-dimensional multi-task learning). 

(d) We prove the lower bounds showing that the rate r max(m, T) /N is min- 
imax optimal for the prediction error and Schatten-2 (i.e., Frobenius) 
norm estimation error under the RI condition on the class of matrices 
A* of rank smaller than r. Our result is even more general because we 
prove our lower bound on the intersection of the Schatten-0 ball with 
the Schatten-p ball for any < p < 1 , which allows us to show minimax 
optimality of the upper bounds of (a) as well. Furthermore, we prove 
minimax lower bounds for collaborative sampling and USR matrix com- 
pletion problems. 

The main point of this paper is to show that the suitably tuned Schatten 
estimators attain the optimal rate of prediction error up to logarithmic fac- 
tors. The striking fact is that we can achieve this not only under the very 
restrictive assumption, such as the RI condition, but also under very mild 
assumptions on the masks X{. 

Finally, it is useful to compare the results for matrix estimation when 
the sparsity is expressed by the rank with those for the high-dimensional 
vector estimation when the sparsity is expressed by the number of nonzero 
components of the vector. For the vector estimation, we have the linear 
model 

Y i = X' i p + S i , i = l,...,N, 

where Xi G R p , (3 £ W and, for example, £j are i.i.d. AA(0, 1) random vari- 
ables. Consider the high-dimensional case p^$> N. (This is analogous to the 
assumption m 2 S> N in the matrix problem and means that the nominal 
dimension is much larger than the sample size.) The sparsity assumption for 
the vector case has the form s N, where s is the number of nonzero com- 
ponents, or the intrinsic dimension of (3. Let f3 be an estimator of (3. Then 
the optimal rate of convergence of the prediction risk N~ l J2i=i( x 'iW ~ Z 3 )) 2 
on the class of vectors f3 with given s is of the order s/N, up to logarithmic 
factors. This rate is shown to be attained, up to logarithmic factors, for many 
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estimators, such as the BIC, the Lasso, the Dantzig selector, Sparse Expo- 
nential Weighting, etc.; cf., for example, Bunea, Tsybakov and Wegkamp 
(2007), Koltchinskii (2008), Bickel, Ritov and Tsybakov (2009), Dalalyan 
and Tsybakov (2008). Note that this rate is of the form ^"pfj™ 011 = 
up to a logarithmic factor. The general interpretation is therefore com- 
pletely analogous to that of the matrix case: Assume for simplicity that 
A* is a square m x m matrix with rank(A*) = r. As mentioned above, the 
intrinsic dimension (the number of parameters to be estimated to recover 
A*) is then (2m — r)r, which is of the order ~ rm if r <C m. An interesting 
difference is that the logarithmic risk inflation factor is inevitable in the 
vector case [cf. Donoho et al. (1992), Foster and George (1994)], but not in 
the matrix problem, as our results reveal. 

This paper is organized as follows. In Section 2, we introduce notation, 
some definitions, basic facts about the Schatten quasi-norms and define the 
Schatten-p estimators. Section 3 describes elementary steps in their conver- 
gence analysis and presents two general approaches to upper bounds on the 
estimation and prediction error (cf. Theorems 1 and 2) depending on the 
efficient noise level r. Our main results are stated in Sections 4, 5 (matrix 
completion), 6 (multi-task learning). They are obtained from Theorems 1 
and 2 by specifying the effective noise level r under particular assumptions 
on the masks Xj. Concentration bounds for certain random matrices lead- 
ing to the expressions for the effective noise level are collected in Section 8. 
Section 7 is devoted to minimax lower bounds. Sections 9 and 10 contain the 
main proofs. Finally, in Section 11 we establish bounds for the fcth entropy 
numbers of the quasi-convex Schatten class embeddings S^ 4 S 2 , P < 1, 
which are needed for our proofs and are of independent interest. 

2. Preliminaries. 

2.1. Notation, definitions and basic facts. We will write | • I2 for the Eu- 
clidean norm in M. d for any integer d. For any matrix A £ M mx - r , we denote 
by A(ja for 1 < j < m its jth row and write At.^ for its feth column, 
1 < k < T. We denote by <ti(A) > a 2 {A) > ■ ■ ■ > the singular values of A. 
The (quasi-)norm of some (quasi-) Banach space B is canonically denoted 
by || • In particular, for any matrix A G M mxT and < p < 00 we consider 
the Schatten (quasi-)norms 



V 3=1 / 

The Schatten spaces S p are defined as spaces of all matrices A £ M mxT 
equipped with quasi-norm \\A\\s p - In particular, the Schatten-2 norm coin- 
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cides with the Frobenius norm 

1/2 



\A\\ S2 = ^{^A)={y j ^ 

i,3 



where au denote the elements of matrix A £ R mxT . Recall that for < p < 1 
the Schatten spaces S p are not normed but only quasi- normed, and || • ||g 
satisfies the inequality 

(2.1) U + B\\p <\\A\\ P o +\\B\\p 

\ / ii ii £}p ii ii ii ii 

for any < p < 1 and any two matrices A, B G IR mxT ; cf. McCarthy (1967) 
and Rotfeld (1969). We will use the following well-known trace duality prop- 
erty: 

| tv{A'B)\ < \\A\\ Sl \\B\\ Soc VA B e R mxT . 

2.2. Characteristics of the sampling operator. Let C : M mxT — >■ be the 
sampling operator, that is, the linear mapping defined by 



A^(tr(X[A), . . .MX' N A))/VN. 

We have 



| J C(^)|| = AT- 1 ^tr 2 (^). 



i=l 

Depending on the context, we also write o12,n{A,B) for \C(A — B)\2, where 
A and B are any matrices in M. mxT . Unless the reverse is explicitly stated, 
we will tacitly assume that the matrices Xi are nonrandom. 

We will denote by </> m ax(l) the maximal rank-1 restricted eigenvalue of C: 

(2.2) max (l) = sup 



mxT :rank(yl)=l ll^lls^ 

We now introduce two basic assumptions on the sampling operator that will 
be used in the sequel. The sampling operator C will be called uniformly 
bounded if there exists a constant cq < oo such that 



\C(A)\ 2 

(2.3) sup 2 2 < Co uniformly in m, T and N. 

Am mxT \{o} \\ A \\s2 



Clearly, if C is uniformly bounded, then </> 2 nax (l) < Co- Condition (2.3) is 
trivially satisfied with cq = 1 for USR matrix completion and with cq = 1 /N 
for CS matrix completion. 
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The sampling operator C is said to satisfy the Restricted Isometry con- 
dition RI (r,v) for some integer 1 < r < rnin(m,T) and some < v < oo if 
there exists a constant 6 r G (0, 1) such that 

(2.4) (1 - * r )M|| Sa < y\C{A)\ 2 < (1 + 6 r )\\A\\s 2 

for all matrices A G M mxT of rank at most r. 

A difference of this condition from the Restricted Isometry condition, 
introduced by Candes and Tao (2005) in the vector case or from its analog 
for the matrix case suggested by Recht, Fazel and Parrilo (2010), is that 
we state it with a scaling factor v. This factor is introduced to account for 
the fact that the masks X, are typically very sparse, so that they do not 
induce isometries with coefficient close to one. Indeed, v will be large in the 
examples that we consider below. 

2.3. Least squares estimators with Schatten penalty. In this paper, we 
study the estimators A defined as a solution of the minimization problem 

(2.5) min ( 1 V (Y, - ti(X' t A)) 2 + X\\A\f s \ 

with some fixed < p < 1 and A > 0. The case p = 1 (matrix Lasso) is of 
outstanding interest since the minimization problem is then convex and thus 
can be efficiently solved in polynomial time. We call A the Schatten-p es- 
timator. Such estimators have been recently considered by many authors 
motivated by applications to multi-task learning and recommendation sys- 
tems. Probably, the first study is due to Srebro, Rennie and Jaakkola (2005) 
who dealt with binary classification and considered the Schatten- 1 estima- 
tor with the hinge loss rather than squared loss. Argyriou et al. (2008), Ar- 
gyriou, Evgeniou and Pontil (2008), Argyriou, Micchelli and Pontil (2010), 
Bach (2008), Abernethy et al. (2009) discussed connections of (2.5) to other 
related minimization problems, along with characterizations of the solutions 
and computational issues, mainly focusing on the convex case p= 1. Also for 
the nonconvex case (0 <p < 1), Argyriou et al. (2008), Argyriou, Evgeniou 
and Pontil (2008) suggested an algorithm of approximate computation of 
Schatten-p estimator or its analogs. However, for < p < 1 the methods can 
find only a local minimum in (2.5), so that Schatten estimators with such p 
remain for the moment mainly of theoretical value. In particular, analyzing 
these estimators reveals, which rates of convergence can, in principle, be 
attained. 

The statistical properties of Schatten estimators are not yet well under- 
stood. To our knowledge, the only previous study is that of Bach (2008) 
showing that for p = 1, under some condition on X-'s [analogous to strong ir- 
represent ability condition in the vector case; cf. Meinshausen and Biihlmann 
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(2006), Zhao and Yu (2006)], rank(^4*) is consistently recovered by rank(A) 
when m, T are fixed and N — > oo. Our results are of a different kind. They are 
nonasymptotic and meaningful in the case mT N > max(m,T). Further- 
more, we do not consider the recovery of the rank, but rather the estimation 
and prediction properties of Schatten-p estimators. 

After this paper has been submitted, we became aware of interesting 
contemporaneous and independent works by Candes and Plan (2010b), Ne- 
gahban et al. (2009) and Negahban and Wainwright (2011). Those papers 
focus on the bounds for the Schatten-2 (i.e., Frobenius) norm error of the 
matrix Lasso estimator under the matrix RI condition. This is related to 
the particular instance of our results in item (c) above with p = 1 and q = 2. 
Their analysis of this case is complementary to ours in several aspects. Ne- 
gahban and Wainwright (2011) derive their bound under the assumption 
that X{ are matrices with i.i.d. standard Gaussian elements and A* be- 
longs to a Schatten-p' ball with < p' < 1, which leads to rates different 
from ours if p' ^ 0. An assumption used in this context in Negahban and 
Wainwright (2011) is that N > mT (in our notation), which excludes the 
high-dimensional case mT 3> N that we are mainly interested in Candes and 
Plan (2010b) consider approximately low-rank matrices, explore the closely 
related matrix Dantzig selector and provide lower bounds corresponding to 
a special case of item (d) above. The results of these papers do not cover 
the matrix completion and multi-task learning problems, which are in the 
main focus of our study. We also mention a more recent work by Bunea, She 
and Wegkamp (2010) dealing with a special case of our model and analyzing 
matrix estimators penalized by the rank. 

3. Two schemes of analyzing Schatten estimators. In this section, we 
discuss two schemes of proving upper bounds on the prediction error of A. 
The first bound involves only the Schatten-p norm of matrix A* . The second 
involves only the rank of A* but needs the RI condition on the sampling 
operator. 

We start by sketching elementary steps in the convergence analysis of 
Schatten-p estimators. By the definition of A, 

N N 
- ^(y, - tr(X'A)) 2 + X\\Af Sp < - Y^iXi - tr(A^*)) 2 + \\\A* \\ p Sp . 
i=i i=i 

Recalling that Y{ = ti^X^A*) + £j, we can transform this by a simple algebra 
to 

2 N 

(3.1) 4^(i,A*) 2 <-^e i tr((i-^)%) + A(P*r 5p -||if 5p ). 

i=l 
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Table 1 

Effective noise level for uniformly bounded £., USR and collaborative sampling matrix 
completion. Here M = max(m, T), and the constants c > 0, c(p) > depend only on a 



Assumptions on Xi 


Assumptions on N, m,T,p 


Value of t 


Uniformly bounded C 


<p< 1 


c(p)(Af/A) 1 - p/2 


USR matrix completion 


p = l, (m + T)mT> N 


cmin(M/A, (log M )/VN) 


CS matrix completion 


p=l 


cM^ 2 /N 



In the sequel, inequality (3.1) will be referred to as basic inequality and the 
random variable iV -1 X^i£jtr((^4 — A*)' Xi) will be called the stochastic 
term. The core in the analysis of Schatten-p estimators consists in proving 
tight bounds for the right-hand side of the basic inequality (3.1). For this 
purpose, we first need a control of the stochastic term. Section 8 below 
demonstrates that such a control strongly depends on the properties of C, 
that is, of the problem at hand. In summary, Section 8 establishes that, under 
suitable conditions, for any < p < 1 the stochastic term can be bounded 
for all 5 > with probability close to 1 as follows: 



(3.2) 



1 N 

i=l 




for p=l, 
A* {ft , for0<p<l, 



where < r < oo depends on m,T and N. The quantity r plays a crucial 
role in this bound. We will call r the effective noise level. Exact expressions 
for r under various assumptions on the sampling operator C and on the 
noise £j are derived in Section 8. In Table 1, we present the values of r for 
three important examples under the assumption that £j are i.i.d. Gaussian 
A^(0,cr 2 ) random variables. In the cases listed in Table 1, inequality (3.2) 
holds with probability 1 — e, where e = (1/C) exp(— C(m + T)) (first and 
third example) and e = (1/C")(max(m, T) + l)~ c (second example) with 
constants C, C > independent of N, m, T. 

The following two points will be important to understand the subsequent 
results: 

• In this paper, we will always choose the regularization parameter A in the 
form A = 4r. 

• With this choice of A, the effective noise level r characterizes the rate of 
convergence of the Schatten estimator. The smaller is r, the faster is the 
rate. 
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In particular, the first line in Table 1 reveals that when M = max(m, T) < N 
the largest r corresponds to p = 1 and it becomes smaller when p decreases 
to 0. This suggests that choosing Schatten-p estimators with p < 1 and es- 
pecially p close to might be advantageous. Note that the assumption of 
uniform boundedness of C is very mild. For example, it is trivially satisfied 
with Co = 1 for USR matrix completion and with cq = 1/N for CS matrix 
completion. However, in these two cases a specific analysis leads to sharper 
bounds on the effective noise level (i.e., on the rate of convergence of the 
estimators); cf. the second and third lines of Table 1. 

In this section, we provide two bounds on the prediction error of A with 
a general effective noise level r. We then detail them in Sections 4-6 for 
particular values of r depending on the assumptions on the AQ. The first 
bound involves the Schatten-p norm of matrix A* . 

Theorem 1. Let A* e R mxT , and let0<p<l. Assume that (3.2) holds 
with probability at least 1 — e for some e > and < r < oo. Let A be the 
Schatten-p estimator defined as a minimizer of (2.5) with A = At. Then 

(3.3) d 2 , N {A,A*) 2 <16t\\A*\\ p Sp 
holds with probability at least 1 — e. 

PROOF. From (3.1) and (3.2) with 5 = 1/2 and A = At, we get 
d 2 ,N(A,A*) 2 < 8t(\\A - A% p + \\A% p - \\A\f Sp ). 
This and the p-norm inequality (2.1) yield (3.3). □ 

The bound (3.3) depends on the magnitude of the elements of A* via 
||^4*||s p . The next theorem shows that under the RI condition this depen- 
dence can be avoided, and only the rank of A* affects the rate of convergence. 

Theorem 2. Let A* e R mxT with rank(A*) < r, and let0<p<l. As- 
sume that (3.2) holds with probability at least 1 — e for some e > and < 
r < oo. Assume also that the Restricted Isometry condition RI ((2 + a)r,v) 
holds with some < v < oo, with a sufficiently large a = a(p) depending only 
on p and with < 5(2+ a )r — f or a sufficiently small 5q = 5q{p) depending 
only on p. 

Let A be the Schatten-p estimator defined as a minimizer of (2.5) with 
A = At. Then with probability at least 1 — e we have 

(3.4) d 2 , N (A, A*) 2 < CirT 2 '^ v 2 ' p ^ 2 ^ , 

(3.5) \\A-A*\\ q Sg < CVW^-pV^ 2 -^ Vq€]p,2], 

where C\ and C 2 are constants, C\ depends only on p and C 2 depends on p 
and q. 
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Proof of Theorem 2 is given in Section 9. The values a = a{p) and d~o(p) 
can be deduced from the proof. In particular, for p = 1 it is sufficient to take 
a = 19. 

Remark 1. Note that if v = 1 the rates in (3.4) and (3.5) do not depend 
on p if we assume in addition the uniform boundedness of C, which is a very 
mild condition. Indeed, taking the value of r from the first line of Table 1 
we see that rT 2 ^ 2 -^^^ 2 "^ ~ rM/N for all <p < 1. Thus, under the RI 
condition, using Schatten-p estimators with p < 1 does not improve the rate 
of convergence on the class of matrices A* of rank at most r. 

Discussion about the scaling factor v . Remark 1 deals with the case v = 1, 
which seems to be not always appropriate for trace regression models. To our 
knowledge, the only available examples of matrices X such that the sampling 
operator C satisfies the RI condition with v = 1 are complete matrices, that 
is, matrices with all nonzero entries, which are random and have specific 
distributions [typically, i.i.d. Rademacher or Gaussian entries; cf. Recht, 
Fazel and Parrilo (2010)]. Except for degenerate cases [such as N = mT, the 
Xi distinct and of the form y/N e k (m)ei(T)' for 1 < k < m, 1 < I < T] the 
sampling operator C defines typically a restricted isometry with v = 1 only 
if the matrices Xi contain a considerable number of (uniformly bounded) 
nonzero entries. 

Let us now specify the form of the RI condition in the context of multi- 
task learning discussed in the Introduction. Using (1.2) for a matrix A = 
(a\ ■ ■ ■ a-p), we obtain 

N 

\C{A)\ 2 = N~ l Y.^W A ) 

i=l 

T n T 
t=l s=l t=l 

where *$>t = n ~ l X^s=i x^ ,s )(x^ ,s ^)' is the Gram matrix of predictors for the 
tth task. These matrices correspond to T separate regression models. The 
standard assumption is that they are normalized so that all the diagonal 
elements of each are equal to 1. This suggests that the natural RI scaling 
factor v for such model is of the order v ~ VT. For example, in the simplest 
case when all the matrices are just equal to the m x m identity matrix, 
we find \C{A)\ 2 = T- 1 Y^ = ia' t ^ t a t = T- 1 \\A\\ 2 S2 . Similarly, we get the RI 

condition with scaling factor v ~ vT when the spectra of all the Gram 
matrices t = 1, . . . , T, are included in a fixed interval [a, b] with < a < 
b < oo. However, this excludes the high-dimensional task regressions, such 
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that the number of parameters m is larger than the sample size, m > n. In 
conclusion, application of the matrix RI techniques in multi-task learning is 
restricted to low-dimensional regression and the scaling factor is v ~ \/T. 

The reason for the failure of the RI approach is that the masks Xi are 
sparse. The sparser are Xi, the larger is v. The extreme situation corresponds 
to matrix completion problems. Indeed, if N < mT, then there exists a ma- 
trix of rank 1 in the null-space of the sampling operator C and hence the 
RI condition cannot be satisfied. For N > mT we can have the RI condition 
with scaling factor v ~ y/mT, but N > mT means that essentially all the 
entries are observed, so that the very problem of completion does not arise. 

4. Upper bounds under mild conditions on the sampling operator. The 

above discussion suggests that Theorem 2 and, in general, the argument 
based on the restricted isometry or related conditions are not well adapted 
for several interesting settings. Motivated by this, we propose another ap- 
proach described in the next theorem, which requires only the comparably 
mild uniform boundedness condition (2.3). For simplicity, we focus on Gaus- 
sian errors £j. Set M = max(m,T). 

Theorem 3. Let £i, . . . ,£jv be i.i.d. J\f(0,o~ 2 ) random variables. Assume 
that M > 1, N > eM and that the uniform boundedness condition (2.3) is 
satisfied. Let A* E M. mxT with rank(^4*) < r and the maximal singular value 
ai(A*) < (N/M) c * for some < C* < oo. Set p = (log^/M))" 1 , c K = (2k- 
1)(2k)k" 1 /( 2k - 1 ) where k = (2-p)/(2- 2p) and 

/ yr \ 1-P/2 

(4-1) \ = 4c K (V/p) 1 - p/ \- W ) 

for some $ > C 2 and C a universal positive constant independent of r, M 
and N . Then the Schatten-p estimator A defined as a minimizer of (2.5) 
with A as in (4-1) satisfies 

rM f N \ 

(4-2) d 2 . N {A,A*f<C^— log^-J 

with probability at least 1 — Cexp(— -&M/C 2 ) where the positive constant C3 
is independent of r, M and N . 

PROOF. Inequality (3.2) holds with probability at least 1 - Cexp(-??M/ 
C 2 ) by Lemma 5. We then use (3.3) and note that, under our choice of p, 
t < cdM / (Np) for some constant c < 00, which does not depend on M and 
N, and 

\\A* f Sp < r[ai (A*)] p < r (%=) ° ' = exp(C*)r. 
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□ 

Finally, we give the following theorem quantifying the rates of convergence 
of the prediction risk in terms of the Schatten norms of A*. Its proof is 
straightforward in view of Theorem 1 and Lemmas 2 and 5. 

Theorem 4. Let £i, . . . , £tv be i.i.d. M(0,a 2 ) random variables and A* £ 
K raxT . Then the Schatten-p estimator A has the following properties: 



(i) Letp = \, and A = 32<7<?!> max (l)y / (m + T)/N. Then 



(4.3) d 2>N (A, A*) 2 < CW> max (l) \\A* \\ Sl y 

with probability at least 1 — 2exp{— (2 — log5)(m + T)} where C > is an 
absolute constant. 

(ii) Let <p< 1 and let the uniform boundedness condition (2.3) hold. 
Set A as in (4-1)- Then 

(4.4) d 2 MAA*f<C\\A% p ^-) 

with probability at least 1 — C exp(— $M/C 2 ) where M = max(m,T) and the 
constant C > is independent of r, M and N. 

In Theorem 5 below we show that these rates are optimal in a minimax 
sense on the corresponding Schatten-p balls for the sampling operators sat- 
isfying the RI condition. 

5. Upper bounds for noisy matrix completion. As discussed in Section 3, 
for matrix completion problems the restricted isometry argument as in The- 
orem 2 is not applicable. We will therefore use Theorems 1 and 3. First, 
combining Theorem 1 with Lemma 3 of Section 8 we get the following corol- 
lary. 

Corollary 1 (USR matrix completion), (i) Let the i.i.d. zero-mean 
random variables £j satisfy the Bernstein condition (8.2). Assume that 
mT(m + T) > N and consider the USR matrix completion model. Let t 2 be 
given by (8.10) with some D>2. Then the Schatten-1 estimator A defined 
with A = 4t2 satisfies 

(5-1) d 2 , N (A,A*) 2 < l6CPl 5l ^^ 

with probability at least 1 — 4exp{— (2 — log 5) (m + T)}, where C = 4aVlOD + 
8HD. 
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(ii) Let the i.i.d. zero-mean random variables satisfy the light tail condi- 
tion (8.3), and let t$ be given by (8.11) for some B > 0. Then the Schatten-1 
estimator A defined with A = 4t3 satisfies 

(5.2) d 2 , N (lAr < l 6 p lgl ^ lQg(max(r ^ 1 ' T + 1)) 

v N 

with probability at least 1 — (1/C)max(m + 1,T+ 1)~ CB for some constant 
C > which does not depend on m,T and N . 

Next, combining Theorem 3 with Lemma 5 of Section 8 we get the fol- 
lowing corollary. 

Corollary 2 (USR matrix completion, nonconvex penalty). Let £i, . . . , 
£tv be i.i.d. AA(0,<7 2 ) random variables. Assume that M = max(m,T) > 1, 
N > eM and consider the USR matrix completion model. Let A* £ M. mxT 
with rank(^4*) < r and the maximal singular value o~i{A*) < {N/M) c * for 
someO<C* <oo. Set p = (log(NfM)y 1 , c K = {2k- 1)(2k)k- 1 /(2«-i) 
K = (2-p)/(2-2p) and 

\=^(m 1 - p/2 (^) 1 ~ P/2 

for some t? > C 2 with a universal constant C > 0, independent of r, M 
and N . Then the Schatten-p estimator A defined as a minimizer of (2.5) 
satisfies 

rM f N \ 

(5-3) d 2>N {A,A*f<C^— log(-J 

with probability at least 1 — Cexp(— -&M/C 2 ), where the positive constant C3 
is also independent of r, M and N . 

Note that the bounds of Corollaries l(i) and 2 achieve the rate rmax(m, 
T)/N, up to logarithmic factors under different conditions on the maxi- 
mal singular value of A*. If max (m,T) < N < mT then the condition in 
Corollary 2 does not imply more than a polynomial in max(m,T) growth 
on o~i(A*), which is a mild assumption. On the other hand, (5.1) requires 
uniform boundedness of a\{A*) by some constant to achieve the same rate. 
However, the estimators of Corollary 2 correspond to nonconvex penalty and 
are computationally hard. 

We now turn to the collaborative sampling matrix completion. The next 
corollary follows from combination of Theorem 1 with Lemmas 3 and 4 of 
Section 8. 
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Corollary 3 (Collaborative sampling). Consider the problem of matrix 
completion with collaborative sampling. 

(i) Let £i,.-.,£jv be i.i.d. M(0,a 2 ) random variables. Let T4 be given 
by (8.12) with some D > 2. Then the Schatten-1 estimator A defined with 
A = 4t4 satisfies 



(5-4) d 2 , N (A,A*) 2 < l6C\\A*\\ Sl ^^- 

with probability at least 1 — 2exp{ — (D — log5)(m + T)}, where C = 8a\HD. 

(ii) Let £1, . . . , £;v be i.i.d. zero-mean random variables satisfying the Bern- 
stein condition (8.2). Let T5 be given by (8.13) with some D > 2. Then the 
Schatten-1 estimator A defined with A = 4rs satisfies 



, N - , i ,,x9 , aJ2D(m + T) + 2HD(m + T) 
(5.5) d 2 , N (A,A*) 2 <64\\A*\\ Sl ^ 1 ^ — — ^ 

with probability at least 1 — 2exp{ — (D — log5)(m + T)}. 

Remark 2. Using the inequality H-A^ < ^/r||^4||5 2 for matrices A of 
rank at most r, we find that that the bound (5.4) is minimax optimal on 
the class of matrices 

{A G ]R' mxT : rank(A) < r, ||A||| 2 < Ca 2 rmax(m, T)} 

for some constant C > 0, if the masks X±, . . . , X^r fulfill the dispersion condi- 
tion of Theorem 7 below. It is further interesting to note that the construc- 
tion in the proof of the lower bound in Theorem 7 fails if the restriction is 
||A||| 2 < 5 2 where 5 2 of smaller order than rmax(m,T). 

6. Upper bounds for multi-task learning. For multi-task learning, we 
can apply both Theorems 2 and 3. Theorem 2 imposes a strong assumption 
on the masks Xi, namely the RI condition. Nevertheless, the advantage is 
that Theorem 2 covers the computationally easy case p = 1 . 

Corollary 4 (Multi-task learning; RI condition). Let £i,...,£/v be 
i.i.d. J\f(0,o~ 2 ) random variables. Consider the multi-task learning problem 
with rank(A*) < r. Assume that the spectra of the Gram matrices *$>t ar e 
uniformly in t bounded from above by a constant c\ < 00. Assume also that 
the Restricted Isometry condition RL (21r, v) holds with some < v < 00 
and with < 52i r < 5o f or a, sufficiently small 5q. Set 



19^ M^ + T) 
nT 2 
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Let A be the Schatten-1 estimator with this parameter A. Then with proba- 
bility at least 1 — 2exp{ — (2 — log5)(m + T)} we have 

m + T s 



d 2 , N (A,A*f <C lCl o- 2 rv 2 



nT 2 

q/2 



U-AT Sq <C 2 cfa«r^(^} Vg€[l,2], 
where C\ is an absolute constant and C 2 depends only on q. 

The proof of Corollary 4 is straightforward in view of Theorem 2, Lemma 2 
and the fact that, under the premise of Corollary 4, we have |>C(A)|| = 
T _1 Yd=i a't^tat < (ci/T)p||! 2 for all matrices A G R mxT , so that the sam- 
pling operator is u niform ly bounded [(2.3) holds with cq = c±/T], and thus 

rmax 

Taking in the bounds of Corollary 4 the natural scaling factor u ~ VT, 
we obtain the following inequalities: 

(6.1) d 2 MlA*f<C 1 r -^±^, 

(6-2) <c 2 r(m + T) ; 

v J T" 1162 - nT 

where the constants Ci and C 2 do not depend on m,T and n. 

A remarkable fact is that the rates in Corollary 4 are free of logarithmic 
inflation factor. This is one of the differences between the matrix estimation 
problems and vector estimation ones, where the logarithmic risk inflation 
is inevitable, as first noticed by Donoho et al. (1992), Foster and George 
(1994). For more details about optimal rates of sparse estimation in the 
vector case, see Rigollet and Tsybakov (2010). 

Since the Group Lasso is a special case of the nuclear norm penalized min- 
imization on block-diagonal matrices [cf., e.g., Bach (2008)] Corollary 4 and 
the bounds (6.1), (6.2) imply the corresponding bounds for the Group Lasso 
under the low-rank assumption. To note the difference with from the previ- 
ous results for the Group Lasso, we consider, for example, those obtained in 
multi-task setting by Lounici et al. (2009, 2010). The main difference is that 
the sparsity index s appearing in Lounici et al. (2009, 2010) is now replaced 
by r. In Lounici et al. (2009, 2010), the columns a* t of A* are supposed to 
be sparse, with the sets of nonzero elements of cardinality not more than s, 
whereas here the sparsity is characterized by the rank r of A* . 

Finally, we give the following result based on application of Theorem 3. 



Corollary 5 (Multi-task learning; uniformly bounded C). Let £i, . . . , 
£tv be i.i.d. AA(0,<7 2 ) random variables, and assume that n > e. Consider 
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the multi-task learning problem with A* 6 R mxT , rank(^4*) < r, such that 
the maximal singular value o~\(A*) < n c * for some < C* < oo. Assume 
that the spectra of the Gram matrices are uniformly in t bounded from 
above by cqT where cq < oo is a constant. Set p = (logn) -1 , c K = (2k — 
1)(2k)k" 1 /( 2k - 1 ) where K=(2-p)/(2-2p) and 

/ -, \ l-p/2 

X = Ac K ^/p) 1 ^ 2 (-j 

for some $>C 2 and a universal constant C > 0, independent of r, m and n. 
Then the Schatten-p estimator A with this parameter A satisfies 

~ r M 

(6.3) d 2 , N (A,A*) 2 <C 3 &— logn 

nl 

with probability at least 1 — C exp(— -dM/C 2 ) where M = max(m, T), and the 
positive constant C3 is independent of r, m and n. 

Corollary 5 follows from Theorem 3. Indeed, it suffices to remark that, 
under the premises of Corollary 5, we have |£(^4)|2 = T^ 1 Ylt=i a t^tat < 
co||A|||, 2 for all matrices A S M mxT , so that the sampling operator is uni- 
formly bounded; cf. (2.3). 

For m = T, we can write (6.3) in the form 

(6.4) d 2 . N (AA*) 2 <C' 3 r -^logn. 

nl 

Clearly, this bound achieves the optimal rate "intrinsic dimension/sample 
size" ~ rm/N, up to logarithms (recall that iV = nT in the multi-task learn- 
ing). The bounds (6.1) and (6.2) achieve this rate in a more precise sense 
because they are free of extra logarithmic factors. 

Another remark concerns the possible range of m. It follows from the 
discussion in Section 3 that the "dimension larger than the sample size" 
framework is not covered by Corollary 4 since this corollary relies on the 
RI condition. In contrast, the bounds of Corollary 5 make sense when the 
dimension m is larger than the sample size n of each task; we only need to 
have m <ti exp(n) for Corollary 5 to be meaningful. Corollary 5 holds when 
the RI assumption is violated and under a mild condition on the masks X{. 
The price to pay is to assume that the singular values of A* do not grow 
exponentially fast. Also, the estimator of Corollary 5 corresponds to p < 1, 
so it is computationally hard. 

7. Minimax lower bounds. In this section, we derive lower bounds for 
the prediction error, which show that the upper bounds that we have proved 
are optimal in a minimax sense for two scenarios: (i) under the RI condi- 
tion and (ii) for matrix completion with collaborative sampling. We also 
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provide a lower bound for USR matrix completion. Under the RI condition 
with v = 1, minimax lower bounds for the Frobenius norm \\A — ^4*||s 2 on 
"Schatten-O" balls {A* £ W nxT : rank(A*) < r} are derived in Candes and 
Plan (2010b) with a technique different from ours, which does not allow 
one to include further boundedness constraints on A* in addition to that it 
has rank at most r. Specifically, they prove their lower bound by passage 
to Bayes risk with an unbounded support prior (Gaussian prior). Our lower 
bounds are more general in the sense that they are obtained on smaller sets, 
namely, the intersections of Schatten-0 and Schatten-p balls. This is sim- 
ilar in spirit to Rigollet and Tsybakov (2010) establishing minimax lower 
bounds on the intersection of £q and l\ balls for the vector sparsity sce- 
nario. In what follows, we denote by inf^ the infimum over all estimators 
based on (X 1 ,Y 1 ), . . . ,(X N ,Y N ), and for any A el mxT we denote by F A 
the probability distribution of (Yi, . . . , Yjv) satisfying (1.1) with A* = A. 

Theorem 5 (Lower bound — Restricted Isometry). Let £i, . . . ,£jv be i.i.d. 
M(0, a 2 ) random variables for some a 2 > 0. Let M = max(m, T) > 8, r > 1, 
min(T, m) > r and for < a < 1/8 define 

frM AV fM\ 1 ~ p/2 a2 

a(l-5 r ) 2 log2 a log2 2 

° {a) = 2^(1 + 5 r ) 2 Tm md C{a ^ = 2V P (l + 6 r ) 2 l^ U ■ 



(i) Assume that the sampling operator C satisfies the right-hand side 
inequality in the RI (r, v) -condition (2.4) for some 5 r £ (0, 1). Then for any 
p£ (0,2], A>0, 0<a<l/8, 

(7.1) inf sup W A *{\\A-A*\\\>C{a,v)o- 2 il)M,N,rA)>Pi 

A A*eM. mxT : 

rank(A* )<r,|| A* \\ Sp < Aver 

where f3 = /3(M, a) > is such that /3(M, a) — > 1 as M — > oo, a — > 0. 

(ii) Assume that the sampling operator C satisfies the RI (r,v)- condition 
(24) for some 6 r £ (0, 1). Then for any p £ (0,2], A > 0, < a < 1/8, with 
(3 as in (7.1), 

(7.2) inf sup F A *(d 2>N (A,A*) 2 >C(a)o- 2 ^ M ,N,rA)>P- 
A A*eR mxT : 

rank(A* )<r, || A* ||s p <Ai/ cr 



Remark 3. It is worth to note that C(a) and /3(M, a) do not depend 
on the constant v of the RI condition. 
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Proof of Theorem 5. Without loss of generality, we assume that 
M = m >T. For a constant 7 > and an integer s G {1, 2, . . . , r}, both to 
be specified later, define 

A sn = {A = (ay) G M mxT : Oij G {0, ju/VN} if 1 < j < s; aij = otherwise}. 

By construction, any element of A s ~ f as well as the difference of any two 
elements of A sn has rank at most s. Due to the Varshamov-Gilbert bound 
[cf. Lemma 2.9 in Tsybakov (2009)], there exists a subset A®^ C A s ^ of 
cardinality Card(^4° 7 ) > 2 sm / 8 containing Aq = such that for any two 
distinct elements A\ and A 2 of A®^, 

(7.3) d 2 ,Ar(Ai, A 2 ) 2 > v~\\ - b r ) 2 \\A x - A 2 ||| 2 > (1 - <5 r ) 2 ^ 

where the first inequality follows from the left-hand side inequality in the 
RI condition (2.4) and is only used to prove (7.2). We will prove (7.2); the 
proof of (7.1) is analogous in view of the second inequality in (7.3). 

Then, for any A\ G .4.°, the Kullback-Leibler divergence K(F Ao ,F Al ) 
between F Ao and F Al satisfies 

(7.4) K(F Ao ,F Al ) = ^Ld 2:N (A , A x f < ^L(l + 5 r ) 2 sM, 

where we used again the RI condition. We now apply Theorem 2.5 in Tsy- 
bakov (2009). Fix some a G (0, 1/8). Note that the condition 

(? - 5) CardMO )-l ^ K(F A ,F Ao )<alog(C a vd(A Sj7 )-l) 
is satisfied for 7 2 < aa 2 (log 2)/ (4(1 + 5 r ) 2 ). Define 



p/2 

ta = arg min<! I G N : A p < I 



MV 

and consider separately the following three cases. 

The case ta = 1. In this case, ipM,N,r,A = A 2 for any r > 1, and A 2 N/M < 
1. Set 

1 A ( a lQg2 2 A 2^ XV2 

51 = 1 and 71 = ((TT^F^ CJA m, 



Then \\A\\ Sp < ||A|| 52 < yjM/Nu^ < Ava for all ,4 G -4i, 7l , i.e., .4 1)71 is 
contained in the set 

{,4 G R mxT : r&nk{A) < r, \\A\\ Sp < Ava}. 
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Now, inequality (7.3) shows that d^jv^ij A2) 2 > 4C(a)<7 2 A 2 for any two 
distinct elements A 1 ,A 2 £ ^?, 7l , while A 2 N/M < 1 implies that < 
aa 2 (log 2)/(4(l + 5 r ) 2 ). Hence, condition (7.5) is satisfied. 

The case 2<r A <r. In this case, the rate ipM,N,r,A. is equal to A p (M /N) l ~ p / 2 . 
We consider the set ^.^,72 with some 72 to be specified below. For A G 
-4° Ai72 , we have p||| p < r^ 2 ~ p)/p p||| 2 < r 2 / p j 2 u 2 M/N. Since also r A < 
2Ap(N/M)p/ 2 when r A > 2, it follows that P||s p < Aw whenever 

(7.6) 72 < 2- 1/p a. 

Now define 

S2 = r A and 72 = 2^ f,,-^^ 2 ' ^ 



(l + <5 r 

Then (7.5) is satisfied and 72 fulfills also the constraint (7.6), since a < 1/8, 
(log 2)/4 < 1. Thus, ^4r A , 72 is a subset of matrices A <E M mxT with rank(yl) < 
r and ||.A||s < Aver. Finally, (7.3) implies that 

kAM^f > (i - Sr ff^L > (i - , r)2 | A ^)" P/2 

= 4C(a)a 2 A^^J 

for any two distinct elements Ai,A 2 of -4r A ,7 2 - 

The case r A > r. In this case, ipM,N,A,r = tM/N. The conditions required 
in Theorem 2.5 of Tsybakov (2009) follow immediately as above, this time 
with the set of matrices «4° 73 , where 73 = aa 2 (log 2)/(4(l + £ r ) 2 ). □ 

Remark 4. Theorem 5 implies that the rates of convergence in The- 
orem 4 are optimal in a minimax sense on Schatten-p balls {^4* S M mxT : 
||^*||£p < A} under the RI condition and natural assumptions on m,T and 
N. Indeed, using Theorem 5 with no restriction on the rank [i.e., when 
r = min(m,T)], and putting for simplicity A = 1, we find that the rate in 
the lower bound is of the order min(min(m, T)M/N, (M/iV) 1_p / 2 , 1). For 
m = T{= M) and m 3 > N > m this minimum equals (M/N) 1 ~ p / 2 , which 
coincides with the upper bound of Theorem 4. 

The lower bound for the prediction error (7.2) in the above theorem does 
not apply to matrix completion with N < mT since then the Restricted 
Isometry condition cannot be satisfied, as discussed in Section 3. However, 
for the bound (7.1) we only need the right-hand side inequality in the RI con- 
dition. For example, the latter is trivially satisfied for CS matrix completion 
with v = \[N and 8 r = 0. This yields the following corollary. 
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Corollary 6 (Lower bound — CS matrix completion). Let £1, . . . , £jv be 
i.i.d. A/"(0,cr 2 ) random variables for some a 2 > 0. Let M = max(m,T) > 8, 
r > 1, min(T, m) > r, and consider the problem of CS matrix completion. 
Then for anype (0,2], A > 0, < a < 1/8, 

inf sup F A * (h\A - A*fs 2 > C'{a)o- 2 ^M,N,r,A] > P, 

A A*eK raxT : V iV / 

rank (A *)< r, 1 1 A * \ \ Sp < A yTja 

where C'(a) = a(log2)2 -2 / p /128 and f3 = f3(M,a), ipM,N,r,A are as in The- 
orem 5. 



The model of uniform sampling without replacement considered in Candes 
and Recht (2009) is a particular case of CS matrix completion. In the noisy 
case, Keshavan, Montanari and Oh (2009) obtain upper bounds under such 
a sampling scheme with the rate rM/N, up to logarithmic factors. The lower 
bound of Corollary 6 is of the same order when A = oo, that is, for the class 
of matrices of rank smaller than r. However, Keshavan, Montanari and Oh 
(2009) obtained their bounds on some subclasses of this class characterized 
by additional strong restrictions. 

It is useful to note that for bounds of the type (7.1) it is enough to have 
a condition on C in expectation, as specified in the next theorem. 

Theorem 6. Let £x, • • • ,£n be i.i.d. N(0,a 2 ) random variables for some 
a 2 > 0. Let M = max(m,T) > 8, r > 1, min(T, m) > r, and assume that 
X\ , . . . , Xn are random matrices independent of £i , . . . , £jv , o,nd the sampling 
operator satisfies z^ 2 IE|i2(A) || < ||-A||| for some v > and all A 6 R mxT such 
that rank(^4) < r. Then for any pe (0,2], A > 0, < a < 1/8, 

inf sup (^\\A - A*fs 2 > C'(a)a 2 i> M)N>rA ) > (3, 

A A*m mxT : \ U J 

rank(A*)<r,|| J 4*||s p <Ai/o- 

where C (a) =a(log2)2- 2 /P/128 and /3 = P(M,a), i/) M , N r A a> r e o- s i n The- 
orem 5. 



Proof. We proceed as in Theorem 5, with the only difference in the 
bound on the Kullback-Leibler divergence. Indeed, under our asumptions, 
instead of (7.4) we have 

(7.7) K& M ,V Al ) = ^E(d 2tN (A , A,) 2 ) < ^L\\Aa - A x ||| 2 < 1^-. u 

Theorem 6 applies to USR matrix completion with v = \JrnT . Indeed, 
in that case mTE|£(j4)|2 = \A\^ S . In particular, Theorem 6 with A = oo 
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shows that on the class of matrices of rank smaller than r the lower bound 
of estimation in the squared Frobenius norm for USR matrix completion is 
of the order rM/N . 

The next theorem gives a lower bound for the prediction error under 
collaborative sampling without the RI condition. Instead, we only impose 
a rather natural condition that the observed noisy entries are sufficiently 
well dispersed, that is, there exist r rows or r columns with more that nMr 
observations for some fixed k G (0, 1] . We state the result with an additional 
constraint on the Frobenius norm of A*, in order to fit the corresponding 
upper bound (cf. Remark 2 in Section 5). 

Theorem 7 (Lower bound — CS matrix completion). Let £i,...,£jv be 
i.i.d. M(0,a 2 ) random variables for some a 2 > and assume that the masks 
X\ = (m)e^ i (T), . . . , Xn = ei N (m)e'j N (T) are pairwise different. Let min(T, 
m) > r and nMr > 8 for some fixed k G (0, 1], where M = max(m,T). As- 
sume furthermore that the following dispersion condition holds: there exist 
numbers 1 < k\ < ■ ■ ■ < k r < T or 1 < k[ < ■ ■ ■ < k' r < m such that either 
the set {(h,ji), ■ . ■ , (iN,jN)} H {(i, k±), . . . , (i, k r ) :i = 1, . . . , m} or the set 
{ (*l , ii ), - - - , (*AT, Jjv) } n {(k' l ,j),...,(k' T ,j):j = l,...,T} has cardinality at 
least nMr + 1 . Define C s , r = {A G M mxT : rank(A) <r and \\A\\s 2 < 8}. Then 
for any0<a< 1/8 and 5 2 > aa 2 (log 2) {nMr + l)/4, 

d 2 7v(A^*) 2 >C'(a) — — - >/3(kM,«) >0, 
with a function ft — > 1 as nM — > oo, a — > and C'(a) = a(log 2)/128. 



Proof. We proceed as in the case A = oo, p = 2, v = y/N ot Theorem 5 
taking a different set „4° instead A®y. Let, for definiteness, the dispersion 
condition be satisfied with the set of indices K, = {(«i,ji), • • • , (in, On)} H 
{(i, fci), . . . , (i, fc r ) : i = 1, . . . , m}. Then there exists a subset /C' of fC with 
cardinality Card(/C') = [sMrj. We define 

A={A= (aij) G M mxT : G {0, 7} if (i, j) G /C'; ay = otherwise}. 

Any element of A as well as the difference of any two elements of A has rank 
at most r, and ||A||| 2 < 7 2 |~«Mr] , VA G A So, A<zC 6>r if 7 2 (/iMr + l) < <5 2 . 
As in Theorem 5, the Varshamov-Gilbert bound implies that there exists 
a subset .4° C A of cardinality Card(^4°) > 2^ KMr ^/ & containing A = 0, that 
for any two distinct elements A± and A2 of A , 

3 (A aV-n-Ha 4||2 > T 2 MM 
«2,7V(^1,^2) — JV - A2\\s 2 > "g ^ • 

Instead of the bound (7.4), we have now the inequality K(Fa ,^a 1 ) < 
^ [«Mr] for any G .4°. Finally, we choose 7 2 = a<r 2 (log2)/4. With these 
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modifications, the rest of the proof is the same as that of Theorem 5 in the 
case rA > r. □ 

8. Control of the stochastic term. We consider two approaches for bound- 
ing the stochastic term iV -1 £i tr((A — A*)'Xi) on the right-hand side of 
the basic inequality (3.1). The first one used for p = 1 consists in application 
of the trace duality 



(8.1) 



1 N 



N 

i=l 



< \\A-A*\\a, IIM 



Si NVl s„ 



with M = -ZV -1 and then of suitable exponential bounds for the 

spectral norm of M under different conditions on X$, i = 1, . . . ,N . The 
second approach used to treat the case < p < 1 (nonconvex penalties) (cf. 
Section 8.2) is based on refined empirical process techniques. Proofs of the 
results of this section are deferred to Section 10. 

8.1. Tail bounds for the spectral norm of random matrices. We say that 
the random variables i = 1, . . . , N , satisfy the Bernstein condition if 

(8.2) max El&l 1 < -l\a 2 H l ~ 2 , 1 = 2,3,..., 

with some finite constants a and H, and we say that they satisfy the light 
tail condition if 

(8.3) maxE(exp(£?/<x 2 )) < exp(l) 

for some positive constant a 2 . 

Lemma 1. Let the i.i.d. zero-mean random variables satisfy the Bern- 
stein condition (8.2). Let also either 



ow 

1=1 



and 



(8.5) max \X i / j .)L<H mw 

v ; l<j<m,l<i<N l *W. Ji2- 

or the conditions 



1 N 

5 - 6 ) ™ T ^ J2\ X i(;k)\l < S col 
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and 

(8.7) max \Xji. u|, < i?c i 

V ' l<k<T,l<i<N l H>*;I2- 

/ioZcZ irue mi/i some constants S row , H TOW , S co \, H co \. Let D > 1. Then, re- 
spectively, with probability at least 1 — 2/m D ~ 1 or at least 1 — 2/T D ~ 1 we 
have 

(8-8) l|M|| 5oo <r, 



where r = r row = C Iow y/m(\ogm) /N if (8.4) and (8.5) are satisfied or 
Tcol = Ccoii/T(log T) /JV} »/ (S.tfj and fS.7j /iota. #ere 



C row = v / 2Do- 2 5 r 2 ow + 2DH TOW H 



log to 



A' 



C col = [ \/2Da 2 S 2 col + 2 J Dif colJ H- 



logT 



Lemma 2. Lei £1, . . . ,£j\r be i.i.d. N(0,a 2 ) random variables. Then, for 
any D>2, 



/ <fn -L- T 1 

(8-9) ||M|| Soc <4V2^o-</) ma x(l)y^^=:T 1 

with probability at least 1 — 2exp{ — (D — log5)(m + T)}, where <^ max (l) is 
the maximal rank 1 eigenvalue of the sampling operator C 



If m and T have the same order of magnitude, the bound of Lemma 2 is 
better, since it does not contain extra logarithmic factors. On the other hand, 
if m and T differ dramatically, for example, m^>T, then Lemma 1 can pro- 
vide a significant improvement. Indeed, the "column" version of Lemma 1 
guarantees the rate r ~ \JT logT/ \f~N which in this case is much smaller 
than yJm/N. In all the cases, the concentration rate in Lemma 2 is expo- 
nential and thus faster than in Lemma 1. 

The next lemma treats the stochastic term for USR matrix completion. 

Lemma 3 (USR matrix completion), (i) Let the i.i.d. zero-mean random 
variables £j satisfy the Bernstein condition (8.2). Consider the USR matrix 
completion problem and assume that mT(m + T) > N. Then, for any D > 2, 

(8.10) \\M\\ Soo < (AaVWD + 8HD)'^j^=:T2 

with probability at least 1 — 4exp{ — (2 — log 5) (to + T)}. 
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(ii) Assume that the i.i.d. zero-mean random variables £j satisfy the light 
tail condition (8.3) for some a 2 > 0. Then for any B > 0, 

-a log(max(m + 1, T + 1)) 



$.11 



M 



N 



with probability at least 1 — (1/C)max(m + 1,T+ 1 
C > which does not depend on m,T and N . 



,-CB 



for some constant 



The proof of part (i) is based on a refinement of a technique in Vershynin 
(2007), whereas that of part (ii) follows immediately from the large devia- 
tions inequality of Nemirovski (2004). For example, if £j ~ A/"(0, a 2 ), in which 
case both results apply, the bound (ii) is tighter than (i) for sample sizes 
N <C (m + T) 2 which is the most interesting case for matrix completion. 

Much tighter bounds are available when the Xi are constrained to be 
pairwise different. Besides it is noteworthy that the rates in (8.12) and (8.13) 
below are different for Gaussian and Bernstein errors. 

Lemma 4 (Collaborative sampling). Consider the problem of CS matrix 
completion. 

(i) Let £i,...,£at be i.i.d. J\f(0,o~ 2 ) random variables. Then, for any 
D>2, 



(8.12) || M || 5oo <8av^p^=:T 4 

with probability at least 1 — 2exp{— (D — log 5) (to + T)}. 

(ii) Let £i,...,£jv be i.i.d. zero-mean random variables satisfying the 
Bernstein condition (8.2). Then, for any D > 2 and 



5.13) 



IMI 



4aJ2D(m + T) + 8HD(m + T) 
< =: Tc. 

N 5 



with probability at least 1 — 2exp{ — (D — log 5) (to + T)}. 

(hi) Let £i, . . . ,£jv be i.i.d. N(0,cr 2 ) random variables. Then for any A> 



IMI 



aJ2A\og{m + T) 
< — max < 

N 





TV 


1/2 


TV 


1/2 N 


{ 




1 




}=:r 6 




i=i 




i=l 





with probability at least 1 — 2(m + T) 



l-A 



Since the masks X{ are distinct, the maximum appearing in (iii) is bounded 
by -y/max(m,T); in case it is attained, the bound (8.12) is slightly stronger 
since it is free from the logarithmic factor. For N <C mT the tightness of the 
bound in (iii) depends strongly on the geometry of the Xj's and the max- 
imum can be significantly smaller than ^Jmax.{m,T). Note also that the 
concentration in (8.12) is exponential, while it is only polynomial in (iii). 
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8.2. Concentration bounds for the stochastic term under nonconvex penal- 
ties. The last bound in this section applies in the case < p < 1. It is given 
in the following lemma. 

Lemma 5. Let £i, . . . ,£jv be i.i.d. A/"(0,<7 2 ) random variables, 0<p<l 
and M = max(m, T). Assume that the sampling operator C is uniformly 
bounded; cf. (2.3). Set c K = (2k - 1)(2k)k" 1 /( 2k - 1 ) where k = (2-p)/(2 - 
2p). Then for any fixed 8 > 0, d > C 2 and r 7 = c K ('&/p) l - p / 2 (M/N) 1 - p l 2 we 
have 

1 N 



5.14) 



< 6 -d %N (A, A*) 2 + T 7 8 p ~ 1 \\A - A* 



with probability at least 1 — C exp(— i3M/C 2 ) for some constant C = C(p,co, 
a 2 ) > which is independent of M and N and satisfies sup 0<p<q C(p, co,a) < 
oo for all q<\. 

Note at this point that we cannot rely the proof of Lemma 5 directly on 
the trace duality and norm interpolation (cf. Lemma 11), that is, on the 
inequalities 

N 



5.15) 



lj2^(X'i(A-A*)) 



< \\A-A*\\^ IIMI 



<\\A-A* f-P/P-f) \\A — A* II* f 2 ~ p) ||M| 



Indeed, one may think that we could have bounded here the S'oo-norm of M 
in the same way as in Section 8.1, and then the proof would be complete 
after suitable decoupling if we were able to bound from above ||A — ^*||?j 2 

by ^2, at (A, A*) 2 times a constant factor. However, this is not possible. Even 
the Restricted Isometry condition cannot help here because A — A* is not 
necessarily of small rank. Nevertheless, we will show that by other techniques 
it is possible to derive an inequality similar to (8.15) with ^2,^(^4, ^4*) instead 
of ||^4 — ^4* ||s 2 . Further details are given in Sections 10 and 11. 

9. Proof of Theorem 2. 

Preliminaries. We first give two lemmas on matrix decomposition needed 
in our proof, which are essentially provided by Recht, Fazel and Parrilo 
(2010) [subsequently, RFP(IO) for short]. 

Lemma 6. Let A and B be matrices of the same dimension. If AB' = Q, 
A'B = 0, then 

\\A + B\\ P o =\\A\\ p q +\\B\\ p q Vp>0. 

II ii £>p ii ii ii ii Op 
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Proof. For p = 1 the result is Lemma 2.3 in RFP(IO). The argument 
obviously extends to any p > since RFP(IO) show that the singular values 
of A + B are equal to the union (with repetition) of the singular values of 
A and B. □ 

Lemma 7. Let A 6 M mxT with rank(A) = r and singular value decom- 
position A = UAV' . Let B £ M. mxT be arbitrary. Then there exists a decom- 
positon B = B\ + B 2 with the following properties: 

(i) rank( J Bi) < 2rank(A) = 2r, 

(ii) AB' 2 = 0, A'B 2 = 0, 

(iii) ti(B[B 2 )=0, 

(iv) B± and B 2 are of the form 

*- u (k ani B * = u {o l) v ' 

with fin G W' xr . 

The points (i)-(iii) are the statement of Lemma 3.4 in RFP(08), the rep- 
resentation (iv) is provided in its proof. 

Proof of Theorem 2. First note that there exists a decomposition 
A = A^ + A^ with the following properties: 

(i) rank(iW - A*) < 2rank(A*) = 2r, 

(ii) A*(AWy = o, {A*yAW = 0, 

(iii) tr((i«-A*)'A( 2 )) = 0. 

This follows from Lemma 7 with A = A* and B = A — A* . In the notation 
of Lemma 7, we have Bi = - A* and B 2 = A (2) . 

From the basic inequalities (3.1) and (3.2) with 5 = 1/2, we find 

(l-I {0<p<1} /2)d %N (A,A*) 2 

< 2 2 - p r\\A - A*\\ p Sp + 4t(||A*||| - \\A\\ p p ). 
In particular, for the case p = l, 

(9.2) d 2<N (A,A*) 2 < 2t\\A - A% p +4r(\\A*\\ p Sp - \\A\\ p Sp ). 

For brevity, we will conduct the proof with the numerical constants given 
in (9.2), that is, with those for p = l. The proof for general p differs only in 
the values of the constants, but their expressions become cumbersome. 
Using (2.1), we get 

d 2 , N (A,A*) 2 

(9.3) 

<2t\\AW -A*\\l +4n-\\A*\\ 9 q + 2r||i (2) \\ p q -4r||i|| p , . 
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By (2.1) again and by Lemma 6, 

llifc >\\A* + A ( - 2 h\ p - \\AM - A*\\ p q 

II II II II Op 11 11 Jp 

= \\A% P + \\A^%-\\A^-A% P , 

since (A*)' A^ = and A*(A^)' = by construction. Together with (9.3) 
this yields 

(9.4) d 2>N (AA*) 2 < 2r\\AW - A*\f Sp ~ 2t\\A<» \\ p Sp + 4r||i« - A*\f Sp , 
from which one may deduce 

(9.5) d 2 , N (A, A*) 2 < 6r||i« - A*\\ p Sp 
and 

(9.6) \\tt 2 %<nAW-AY Sp 

Consider now the following decomposition of the matrix A^ 2 \ First, recall 
that A^ is of the form 

M 2 ) = u(l S )v'. 



B 22 

Write B 22 = W\A(B 22 )W2 with diagonal matrix A(B 22 ) of dimension r' and 
W[Wi = W^W 2 = I r 'xr' f° r some r' < min(m, T). In the next step, W\ and 
W 2 are complemented to orthogonal matrices W\ and W 2 of dimension 
min(m, T) x min(m, T). For instance, set 



where * complements the columns of the matrix (^,) to an orthonormal 
basis in R mxT , and proceed analogously with W±. In particular, W[W\ = 

W 2 W 2 = Imin(m,T)xmm{m,T)- Also 

\0 W X A(B 22 )W 2 ) 1 ^0 A(B 22 ) J 2 2 

We now represent A^ as a finite sum of matrices A^ = Ylf=i Aj with 

Af ] = UWiDiW^V' 



and 

Di 





A; I ' 
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where the r' x r' diagonal matrix Aj has the form Aj = diag(Aj/{j g /.}), i > 1. 
We denote here by I\ the set of ar indices from {1, . . . , min(ra, T)} corre- 
sponding to the ar largest in absolute value diagonal entries of A, by I2 

the set of indices corresponding to the next ar largest in absolute value di- 

~ (2) 

agonal entries Aj, etc. Clearly, the matrices A) are mutually orthogonal: 

"(2) ~(2) "(2) "(2} 

ti((Aj )' A k ) = for j ^ k and rank(A!- ) < ar. Moreover, A\ is orthog- 
onal to - A*. 

Let <Ti > o"2 > • • • be the singular values of A^ , then a± > ■ • • > a ar are the 
singular values of A\ , a ar +i > • • • > &2ar those of A 2 , etc. By construction, 
we have Card(/j) = ar for all i, and for all k G ij+i 

l/p 



Cfc < min 0",- < 



ar ^ 3 

jeh 7 



Thus, 

2/p 



E «2^£«j)' 



from which one can deduce for all j > 2: 

1/2 / \ i/p 



ii^iu. = ^(-) 1/2 - 1/p (E <*) =M 1/2 - 1/p Pfiii5 P 

and consequently 

Because of the elementary inequality x l / p + y l l p < (x + y) 1 ^ for any non- 
negative x, y and < p < 1, 

EMfik = E(E<)' /P sfeE<) 1/P 

j>2 j>2 y k£lj 7 V J>2 fc6/j 7 

l/p 



<(E^) P = P (2) ik- 



Therefore, 

< 3 1/p (ar) 1 / 2 ^ 1 / p ||i( 1 ) - A*|| 5p [using inequality (9.6)] 

< 3 1 / p (ar) 1 / 2 - 1 ^(2r) 1 /^ 1 /2p(D _ 
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where the last inequality results from ran k(iW-A*)<2r and 



1/p / i \ 1/2 

4 



fc<2r 7 v k<2r 



Finally, 

1/2-1/p 

I A* I, 



/ \ 1/2- l/p 

(9-7) El^f II S2 <3 1/P Q) 



We now proceed with the final argument. First, note that rank^A^ 1 ) — A*) + 
A^ ) < (2 + a)r. Next, by the triangular inequality, the restricted isometry 
condition and the orthogonality of A^ and A^ — A* we obtain 



vd 2 , N (A,A*) = v\£(A-A*)\ 2 



(9.8) > - A* + ii 2) )| 2 - ^J^l £ ^ (2) 



i>2 

> (i- W)H i(1) -^* + 4 2) lls 2 - (i + WEll^flla 

i>2 

> || AW - A* || fla ^(l - <J (2+a)p ) - (1 + <U3 1/p ^- J J 



Define 



a = a (p) = min{A: G N: fc > (6^ /V2) 2p/{2 ~ p) }. 
Then 1 - 3 1 / p (a/2) 1 / 2 " 1 / p > 0. Now, <5 (2+a)r > S ar , and thus 

/ n 1/2-1/p / / xl/2-l/px 

(l-% + a)r)-(l + ^.)3 1 /^-J > (l-S^f -J J-25 (2+a)r >0 

whenever 
(9.9) 



W<|(i-^(1) ~ ) 



In case of (9.9), there exists a universal constant k = n(p) such that 

2 

\s 2 - 



(9.10) v 2 d 2 , N (A, A*) 2 > k||A^ - A* " 2 



Now, inequalities (9.5) and (9.10) yield 

(9.11) k||A« - A*||| 2 < Qtv 2 \\A^ - A*\\ p s < QTv 2 {2r) l - p ' 2 \\A^ - A*||£ 2 , 
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where the second inequality results from the fact that rank(AW — ^4*) < 2r, 
which implies 

(9.12) -A*\\ Sp <(2r) 1 / p - 1 / 2 ||iW -A*\\ S2 . 
From (9.11), we obtain 

(9.13) k\\AW - A*f- p < 6r^ 2 (2r) 1 - ? '/ 2 . 
Furthermore, from (9.5), (9.12) and (9.13) we find 

d 2 n(A,A*) 2 < 6r(2r) 1 ~ p / 2 ||A( 1 ^ ) - A*\\ p q 

(9.14) 

< 2r (6r) 2 /( 2 ~ p ) k^/M v^I^-v) _ 

This proves (3.4). It remains to prove (3.5). We first demonstrate (3.5) for 
q = 2, then for q = p, and finally obtain (3.5) for all q E [p, 2] by Schatten 
norm interpolation. 

Using (9.7), (9.8), (9.14), we find 

(1 - <5 (2+o)r )||i (1) - A* + 4 2) || 52 < vd 2 , N (A,A*) + (1 + 6 ar )J2\\A? ] \\ S2 

i>2 

< CV^/M^/P-p) 
for some constant C = C(p) > 0. This and again (9.7) yield 

|( 2 )||_ ^^|ii( 2 )|i„ < n>.^W-p)rM2-p) 
i>2 



\A-A*\\s 2 < \\AW - A* + 4 2 )|| 52 + ^||if \\ S2 <CV^- 



for some constant C = C'(p) > 0. Thus, we have proved (3.5) for q = 2. 
Next, using inequalities (2.1) and (9.6) we obtain 

\\A-A*\\ p Sp < \\AW-A*f Sp + ||i (2) ||^< 4||i« -A*\\ p p . 

Combining this with (9.12) and (9.13) we get (3.5) for q=p. Finally, (3.5) 
for arbitrary q E [p, 2] follows from the norm interpolation formula 

II A II"? <\\A ||P( 2 -'?)/( 2 -?') || A || 2 (9-P)/(2-P) . 

H^lls, - W^WSp W A \\s 2 ' 

cf. Lemma 11 of Section 11 with 6 = ^-g) • '— ' 

10. Proofs of the lemmas. 

PROOF of Lemma 1. First, observe that 

HMllSoo = sup |Mn|2 < \[m max sup |"u'%|, 

u&J: l<J<m ueR T. 
\u\ 2 =l |w| 2 =l 
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with vectors fjj = N^ 1 Yli=i Ci-^Qfj,-)' Consequently, for any t > 0, 



P llMILo > t 



mlogm 



AT 



< P \fm max Im lo >t 
\ i<i<m 



mlogm 



< m max PI 1 77 j 1 2 > t 



logm 



N 



To proceed with the evaluation of the latter probability, we use the following 
concentration bound [Pinelis and Sakhanenko (1985)]. 



Lemma 8. Let Ci> • ■ • > Ov be independent zero mean random variables in 
a separable Hilbert space H. such that 



(10.1) 



- 1 

^ehc^^hs 2 ^- 2 , 1 = 2,3,..., 



i=l 



with some finite constants B,L> 0. Then 



N 



i=l 



> x I < 2 exp 



H 



IB 1 + 2xL 



Vx>0. 



Setting Cj = £iXi(j.), ti = K T , note first that, by the Bernstein condi- 
tion (8.2), 



v 



v 



i=l 



i=l 



1 



N 



<-l\o 2 H l ~ 2 \ 

< -l!B 2 L 1 - 2 , 
~ 2 



i=i / 



max 



max|X 

hi 



d-i 



«0'.-)l2 



where -B 2 = a 2 S 2 ow N and L = H mw H, that is, condition (10.1) is satisfied. 
Now an application of Lemma 8 yields for any t > 



\Vj\2 > t 



logm 



N 



1 N 

i=i 



>t\/log 



m 



< 2 exp 
= 2 exp I 



iV(logm)t 2 



25 2 + 2tLy/N\ogm 
N(\ogm)t 2 



V 2<r 2 S 2 ow iV + 2tLy/N\ogm 
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Define t = y/2Da 2 S? ow + 2DLJ^!1 for some D > 1. Then 



>D, where B = 2ct 2 S 2 ow ,L = 2lJ 



B + Lt ~ row ' V N 

With this choice of t, 



, log m \ _ n 
f]j\2>ty — 1 < 2exp(— Dlogrre) = 2?n 

and therefore P(||M|| 5oo > 

''"row) ^ 2m} ® , where 



( 



y/2Do*Sl~+2DH t(m H 



logm \ /m log to 



N / V N 



Similarly, using HMHs^ = supi„i 2=1 |v'M|2, and assuming (8.6) and (8.7), we 
get P(||M|| Soo > r co i) < 2T^ D , where 



r col = [ j2Da*S 2 col + 2DH col H 



\ogT\ /TlogT 



N \ N □ 



Proof of Lemma 3. The matrix M = jjr ^ s a sum °f i-i-d. 

random matrices. Therefore, part (ii) of the lemma follows by direct appli- 
cation of the large deviations inequality of Nemirovski (2004). 

To prove part (i) of the lemma, we use bounds on maximal eigenvalues 
of subgaussian matrices due to Mendelson, Pajor and Tomczak-Jaegermann 
(2007); see also Vershynin (2007). However, direct application of these bounds 
(based on the overall subgaussianity) does not lead to rates that are accurate 
enough for our purposes. We therefore need to refine the argument using the 
specific structure of the matrices. Note first that 

||M||s = max I Mi; 1 2 = max iiMt), 
vesT- 1 ues™- 1 ,ves T - 1 

where 5 m_1 is the unit sphere in R m . Therefore, denoting by M m and A4t 
the minimal 1/2-nets in Euclidean metric on S m ~ l and S T ~ 1 , respectively, 
we easily get 

||M|| Soo < 2 max |Mv| 2 < 4 max \u'Mv\. 

veMr u£Mm,v£MT 

Now, Card(A / f m ) < 5 m [cf. Kolmogorov and Tikhomirov (1959)] so that by 
the union bound, for any r > 0, 

(10.2) P(||M|| Soo >t) < 5 m+T max F(\u'Mv\ > r/4). 

u£Mm,v£MT 
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It remains to bound the last probability in (10.2) for fixed u, v. Let us fix 
some u G 5 m_1 , v G S T ~ l and introduce the random event 



A 



{ 1 N 

l 8=1 



2/ 5(m + T) 



-lL,|2 



U o X 



Note that E(-u'A^) 2 = £/ =1 u%vf¥(Xx = e fc (m)e , / (T)) = (mT) 

1 17 1 2 = (mT) -1 , and consider the zero-mean random variables 77^ = (u'Xiv) 2 — 
Eiu'Xiv) 2 = (u'Xivf-imT)- 1 . We have \rji\ <2maxi(u'Xiv) 2 <2\u\l\v\% = 2. 
Furthermore, 

m T 

E(r,f) <E(u'A» 4 <£E^ 4p (*i =e fc (m)eJ(T)) 
fc=i z=i 

m T 

= (mT)- 1 £4£«f<(mrr 1 . 
fc=i i=i 

Therefore, using Bernstein's inequality and the condition (m + T)/N > 
(mT) -1 we get 

iV (4 (m + T) /AO 2 



(10.3) 



PG4 C ) < 2exp^ 2(mT) _! + (4/3)(4(m + T y N y 

< 2exp(-2(m + T)), 
where A c is the complement of A. We now bound the conditional probability 



P(|u'Mi;| >t/4\X u ...,X n ) 



1 r 

jj^&iu'Xiv) 



i=l 



>r/4\X 1 ,...,X N ). 



Note that conditionally on X\, . . . ,Xn, the ^(u'Xiv) are independent zero- 
mean random variables with 



v 



N 



^2E(\Zi(u'Xiv)\ l \Xi, ...,X N )< E\Ci\ l \u'X iV \ 2 V/ > 2, 



i=i 



i=i 



where we used the fact that \v! Xiv\ l 2 < (l^l^h)' 2 = 1 for I > 2. This and 
the Bernstein condition (8.2) yield that, for (X±, . . . ,Xn) G .4, 

I! 



^Ed^u'A^)!'!*!, . . . ,X N ) < -B 2 H l 



i=l 



with B 2 = 5(m + T)a 2 . Therefore, by Lemma 8, for (Xi, . . . , -Xjv) G A we 
have 

iV 2 T 2 /16 

(10.4) 1 



\u'Mv\ >t/4\X 1 ,...,X n ) <2exp^- 



10(J 2 (m + r) +NtH/2J' 
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For r defined in (8.10) the last expression does not exceed 2exp(— D{m+T)). 
Together with (10.2) and (10.3), this proves the lemma. □ 

Proof of Lemma 2. We act as in the proof of Lemma 3 but since the 
matrices Xi are now deterministic, we do not need to introduce the event 
A. By the definition of <?i> max (l), 

1 N 

-]T(n'A» 2 = \C{uv')\ 2 2 < #U(i)IKIII 2 =*Lx(l) 

i=l 

for all u G S m ~ l , v G 5 T_1 . Hence, £i{ u 'Xiv) is a zero-mean Gaussian 

random variable with variance not larger than <^ ax (l)<7 2 /-/V. Therefore, 

/ Nt 2 \ 
F(\u'Mv\ > r/4) < 2exp - . 

V 320£ UHI (l)<rV 

For r as in (8.9) the last expression does not exceed 2exp(— D(m + T)). 
Combining this with (10.2), we get the lemma. □ 

Proof of Lemma 4. We proceed again as in the proof of Lemmas 3 
and 2. Denote by f2 the set of pairs (k, I) such that {X\, . . . , Xjy} = {efc(m)ej(T), 
(k,l) G £1} (recall that all Xi are distinct by assumption). Then 

N 

(10.5) ]T(u'A» 2 = «*«?<HiHl = i 

j=i (fc,0en 

for any u G 5 m_1 ,f G Hence, under the assumptions of part (i) of the 

lemma, 

F(\u'Mv\ > r/4) < 2exp 

which does not exceed 2exp(— D{m + T)) for r defined in (8.12). Combining 
this with (10.2) we get part (i) of the lemma. To prove part (ii) we note 
that, as in the proof of Lemma 3, \v! XiV \ l ~ 2 < 1 for I > 2. This and (10.5) 
yield 

N 71 

Y,n\UuX lV )\ l )<-B 2 H 1 - 2 VI > 2, 
1=1 

with B 2 = a 2 . Therefore, by Lemma 8, we have 

/ iV 2 r 2 /16 \ 
P(|»'M»|>,/4)<2exp(- 2g2 + JV / rJf/2 ), 

and we complete the proof of (ii) in the same way as in Lemmas 3 and 2. 
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Part (iii) follows by an application of Theorem 2.1, Tropp (2010), after 
replacing every Xi by its self-adjoint dilation [see Paulsen (1986)]. □ 



For the proof of Lemma 5 we will need some notation. The pth. Schatten 



class of M x M- matrices is denoted by S!f , and we write 



B(S^) = {Ae 



„MxM 



■■\\Ms P <l} 



for the corresponding closed Schatten-p unit ball in R MxM . For any pseudo- 
metric space (T,d) and any e > 0, we define the covering number 

AA(T,ci,£)=min(Card(7o):7oCT and inf d(t, s) < e for all t G t\. 

I seTo ) 

In other words, M(T, d, e) is the smallest number of closed balls of radius e in 
the metric d needed to cover the set T. We will sometimes write M(T, \\ ■ \\ , e) 
instead of J\f(T,d,e) if the metric d is associated with the norm || • ||. The 
empirical norm || • 1 1 2,7V corresponds to di at, that is, for all yl£ DJ 

N 



3=1 



Proof of Lemma 5. Let us first assume that m = T = M. Since 



sup 

BfzRMxM 



lA/AOEti&trOB'X, 



| R ||l-p/(2-p)|| R || P /(2-p) 
\ £3 \\2,N W^WSp 



sup 



(l/v^OEili&trOB'X,) 



\B 



,l-p/(2-p) 
l2,JV 



the expression on the LHS of (8.14) is not greater than 



M 



d 2 ,N{A,A*) 1 - p /( 2 - p '>\\A- A* 



x sup 



|P/(2-p) 



( M/ p)(p-2)/(2p) iV -l/2 £^ ^tr^Xi) 



((M/p)(f- 2 )/( 2 P)||S|| 2i j V ) 1 -P/( 2 -P) 



Due to the linear dependence in M of the e-entropies of the quasi-convex 
Schatten class embeddings S?f 5*2 (cf- Corollary 7) and the fact that the 
required bound should be uniform in M and in p for p \ 0, we introduced 
an additional weig hting by (M/p)^" 2 ^ 2p . Now define 



'M,p 



{Ae. 



MxM 



By the entropy bound of Corollary 7 and the uniform boundedness condi- 
tion (2.3), 



\ogN{GM,p,d 2 ,N,£) < logAA(£?Af, P , v^oll - \\s 2 i £ ) <poto{p){e/y/co) 



-2p/(2-p) 
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whence 
(10.6) 



^logM(g M , p ,d 2 , N ,e)ds < cg/P^paoW^-f^/M. 



2-2p 

We remark that due to the order specification of ao in Corollary 7, the 
expression 

2-p 



(10.7) 



is uniformly bounded as long as p stays uniformly bounded away from 1. 
Note that for p = 1 the entropy integral on the LHS in (10.6) does not 
converge. 

Claim 1. For any q£ (0,1), there exist constants c(q) and c'(q), such 
that for all <p< q, all < 6 < Jcq and uniformly in M and N , 



(10.8) P 



sup 

)\Bh,N<S 



1 - \ 

-=^tr(Xp) >T <c(g)exp 

7 = 1 / 



T 2 



c{q) 2 8 2 



for allT>d{q)5 1 ^/^- p \ 

Proof. The bound is essentially stated in van de Geer (2000) as Lemma 
3.2 [further referred to as VG(00)]. The constant in VG(00) depends neither 
on the || • 1 1 2, 7v-diameter of the function class nor on the function class it- 
self and is valid, in particular, for e = 0, in the notation of VG(00). The 
uniformity in < p < q follows from the uniform boundedness of (10.7) 
over p G (0,q\. The required case corresponds to K = oo in the notation 
of VG(00). Its proof follows by taking e = and applying the theorem of 
monotone convergence as K — > oo, since the RHS of the inequality is inde- 
pendent of K. □ 

Claim 2. For any q£ (0,1), there exists a constant C{q) such that for 
any <p< q 



(10.9) P 



sup 



:i/v^V)Ef =1 &tr(£%- 



\B 



,l-p/(2-p) 
I2JV 



> T < C{q) exp(-T 2 M/C(qf 



for allT>C{q). 

Proof. First, observe that 
sup \\A\\ 2) n < Vco sup \\A\\s 2 



< ^(M/p)^- 2 )/^) 



sup 



\A\ 



s 2 



r Co (M/ P )^l^, 
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where the last inequality follows from B{Sp I ) C 0(62 )• Define the decom- 
position of Q M,p 

{ m,p = ^ A£ G ^,P ■ (1/2)^(WP) (P " 2)/(2P) 

< \\A\\ 2 , N < (l/2) fc - 1 ^(A//p) (p - 2)/(2p) }, 

Then by peeling-off the class Gm p , we obtain together with claim I for all 
T>d{q) 



sup 

B&Gm,p 



i/y/N^UZMB'Xj) 



\B\\: 



l-p/(2-p) 



2,N 



>T 



<^P( sup 



k=i \Beg 



M,p 



1 N 

3=1 



> T((l/2) k ^(M/p) 



(p-2)/(2p)a-p/(2-p) 



< ^^q) ex p( - 

k=l ^ 



(10.10) <^ C ((?)exp 

k=l 

with the definition 



T 2 (l/2) 2 ((l/2) fc ^(M/p)(f- 2 )/( 2 P))- 2 P/(2-p) 

r 2 M2 fc(2p)/(2-p) Co ( g y 
4pc(g) 2 



p/(2-p) 



C (g)= inf c > 

0<p<q 

It remains to note that the last sum in (10.10) is bounded by C(q) exp(— T 2 M/ 
C(q) 2 ) uniformly in < p < q whenever T >C(q) for some suitable constant 
C{q). This follows from the fact that 



^exp(-p- 1 2^)/( 2 -rt)<^ — 



< 



P 



p-l 2 fc(2p)/(2-p) + 1 - 1 _ (l/ 2 )(2p)/(2-p) ' 
and the latter expression is bounded uniformly in < p < q. □ 



In particular, the result reveals that the LHS of (8.14) is bounded by 

M\ V 2 
iV 



(10.11) d 2 , N (A,A*) 1 ~ p ^ 2 ' p) \\A- A*\\ p s /[2 " p) y/J/p( ^ 
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with probability at least 1 - C exp(-$M/C 2 ) for any \f&>C(q). 

We now use the following simple consequence of the concavity of the 
logarithm which is stated, for instance, in Tsybakov and van de Geer (2005) 
(Lemma 5). 

Lemma 9. For any positive v, t and any k > 1, 5 > we have 

^l/(2«) < {6/2)t + tVe<r l/(2^1) v 2«/(2«- 1 ) ) 

where c K = (2k - 1)(2k)k~ 1 ^ 2k -^ . 



Taking in Lemma 9 



P /(2-p) /^jzfM\ 
N 



t = d 2 , N (A,A*) 2 , v = \\A- A*f>^- p) ^JJJi^ 

and k= (2 — p)/{2 — 2p) shows that for any 5 > 

(10.11) < (8/2)d 2 n(A, A*) 2 + t 7 5 p ^ \\A - A* \\ p s 



1/2 



with probability at least 1 — Cexp(— -&M/C 2 ). 

The case m^T can be deduced from the above result by the following 
observation. For any matrix B = (bij) G K mxT , define the extension B = 
(bij) G R MxM with M = max(m,T) as follows: fty = by for 1 < i < m, 1 < 
j <T and bij = otherwise. Then one easily checks that ||-B||s p = H-BHs for 
all pG [0,oo]. Furthermore, tr(B'Xi) = ti(B'Xi) and 



N-'Ek-M^A) 2 \\A\\l N 
SUp \\~a ||2 = sup —— 2 - 



Consequently, the result follows now from the already established proof for 
the case m = T. □ 



11. Entropy numbers for quasi-convex Schatten class embeddings. Here 
we derive bounds for the kth entropy numbers of the embeddings S^ 1 
S^ 1 for < p < 1 , where S^ 1 denotes the pth Schatten class of real M x 
M-matrices. Corresponding results for the l^f /^-embeddings are given 
first by Edmunds and Triebel (1989) but their proof does not carry over 
to the Schatten spaces. Pajor (1998) provides bounds for the S^ 1 •— > S^- 
embeddings in the convex case, p> 1. His approach is based on the trace 
duality (Holder's inequality for p^ 1 +q~ 1 = 1) and the geometric formulation 
of Sudakov's minoration 



e^logAf(A,\ ■ \ 2 ,e) < cE sup(G, t) 
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for some positive constant c, with a d-dimensional standard Gaussian vector 
G and an arbitrary subset A of R d . Here | • (2 is the Euclidean norm in R d 
and (■,■) is the corresponding scalar product. Guedon and Litvak (2000) 
derive a slightly sharper bound for the l p ^ Z ? -embeddings than Edmunds 
and Triebel (1989) with a different technique. In addition, they prove lower 
bounds. We adjust their ideas concerning finite l p spaces to the nonconvex 
Schatten spaces. 

We denote by e k {id^ r ) the kih entropy number of the embedding S^f >■ 
S^ 1 for < p < r < 00, that is, the infimum of all e > such that there exist 
2 fc_1 balls in S^ 1 of radius e that cover B(Slf). For the general definition of 
kth entropy numbers e k {T : F — >• E) of bounded linear operators T between 
quasi-Banach spaces F and E, we refer to Edmunds and Triebel (1996). 

Recall that a homogeneous nonnegative functional || • || is called C-quasi- 
norm, if it satisfies for all x,y the inequality ||a; + y|| < Cmax(||x||, \\y\\). 
Finally, any p-norm is a C-quasi-norm with C = 2 1 / p [cf., e.g., Edmunds 
and Triebel (1996), page 2]. We will use the following lemma. 

Lemma 10 [Guedon and Litvak (2000)]. Assume that \\ ■ \\i are symmetric 
Ci-quasi-norms on W 1 for i = 0, 1, and for some 9 G (0, 1), || • \\g is a quasi- 
norm on W l such that \\x\\q < H^llolMli -61 f or a ^ x G Then for any 
quasi-normed space F, any linear operator T :F — >■ R n , and all integers k 
and m, we have 

e m+k ^(T :F^E e )< (C e m (T : F -> E )) 6 \de k (T : F -)■ ^i)) 1 " 9 , 
where E t stands for W 1 equipped with quasi-norm \\ ■ \\t, t G {0,6, 1}. 

Guedon and Litvak (2000) did not specify the notion of symmetry they 
used. So we have to clarify that here a (quasi-)norm || • || is called symmetric 
if (M n , || • ||) is isometrically isomorphic to a symmetrically (quasi-) nor med 
operator ideal. This includes the diagonal operator spaces (finite £ p ) as well 
as the Schatten spaces. The proof of Lemma 10 follows the lines of Pietsch 
(1980), Proposition 12.1.12, replacing the triangle inequality by the quasi- 
triangle inequality. Recall that the Schatten classes S p form interpolation 
couples like their commutative analogs l p . 

Lemma 11 (Interpolation inequality). For < p < q < r < 00, let 9 G 
[0, 1] be such that 

9_ 1 -9 _ 1 

p r q 

Then, for all A G R mxT , 

\\A\\ Sq <\\A\f Sp \\A\\^. 
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Proof is immediate in view of the inequalities 



/ \ 1 



9q/p f \ {l-e)q/r 

£4 



3 J 

valid for any nonnegative cu's. 

Proposition 1 (Entropy numbers). Let < p < 1, p < r < oo. Then 
there exists an absolute constant j3 independent of p and r, such that for all 
integers k and M we have 

e k {idf r ) <min<^ l,a(/3,p,r) 



with 

a(/3,p,r)<2 1+1 / r 



k 



^ l/p-l/r f i \ (Vp-l)(l/p-l/r) 



p ) V i ~~ p 



Proof. The fact that ek(id^ r ) is bounded by 1 is obvious, since B{Sp I ) C 
B(S^). Consider the other case. We start with r = oo and then extend 
the result to r < oo by interpolation. Fix some number L > M and let 
D = D(M,L,p) be the smallest constant which satisfies, for all 1 < k < L, 

Let us show that a = sup M L D(M, L,p) is finite. Since || • ||,s , p< 1, can be 
viewed as a quasi-norm on M A/2 (isomorphic to M, MxM ), Lemma 10 applies 
with F = E = , Ex = S™, 6 = p, E e = Sf* and m = 1. This gives 

(11.2) ^(^J^e*^)) 1 "* 

Here the factor 4 follows from the relations C\ = 2 and Cq < 2. Now, (11.2) 
and the factorization theorem for entropy numbers of bounded linear oper- 
ators between quasi-Banach spaces [see, e.g., Edmunds and Triebel (1996), 
page 8], with factorization via S^ 1 , leads to the bound 

efc(^pfoo) < e[(i-p)fe](^Ji)e[p fe] (^f 0O ) 

(11.3) 

< 4(e [(1 _ p)fc] (i^f 00 )) 1 " p e [pA .](^ 



OO / ' 



where for any x 6 (0,oo), [x] denotes the smallest integer which is larger or 
equal to x. Proposition 5 of Pajor (1998) entails \ogN(B(S^)^ \\ ■ Hs^e) < 
cM/e,Ve > 0, and hence 

(11.4) e fc (i<J < c'M/k 
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with constants c and d independent of M, e and k. Note that, in contrast to 
the if 1 ^-embedding, for which the kth. entropy numbers are bounded 
by c"fc _1 log(l + M/k) with some c" > and log 2 M < k < M [see, e.g., 
Edmunds and Triebel (1996), page 98], we have in (11.4) not a logarithmic 
but linear dependence of M in the upper bound. Plugging (11.1) and (11.4) 
into (11.3) yields 

' jM \ J J M \ 1/p V~ p dM 
ei <J<4 1)' 



(1 — p)k J J pk 
Ad f 1 \ {1 - p)/p D ^ p (M\ 1/p 



p \ 1 — p J \ k J 

Thus, by definition of D, 

An' / I \ 0--p)/p 

D p < 



p \ 1 — p / 

which shows that D is uniformly bounded in M and L. This proves the 
proposition for r = oo. 

Consider now the case r < oo. In view of Lemma 11 with 9 =p/r, we 
can apply Lemma 10 with F = E = E 1 = S^, 6=p/r, E e = and 
m = 1 . This yields 

e fc (i<f r ) < 2 1+1 /- (efc( ^A/ o)) i-p/r 

Corollary 7. For any p £ (0, 1), there exists a positive constant ao(p) 
such that for all integers M > 1 and any e G (0, 1], 

logjV(B0Sf ), || • \\s 2 ,e) < a (p)Ade~ 2p /^ p \ 
Moreover, ao(p) = 0(1 /p) for p\0. 

PROOF. The result follows by transforming the entropy number bound 
of Proposition 1 into an entropy bound. Specification of the constant in 
Proposition 1 yields 

«o(p) = o(^(l + T ^) (1 P)IP )=0(l/p) 

as p \ 0. □ 

Acknowledgment. We are grateful to Alain Pajor for pointing out refer- 
ence Guedon and Litvak (2000). 
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